Visualisation of My Personal Google Data (1): My Locations

Mailchimp Subscription Modal

Introduction to the series: “Visualisation of My Personal Google Data”

If you give permission to Google to store your location data, they will keep them in their databases forever. You can also allow them to store it for a while and then ask them delete it. They will directly do so.

What makes this study a fun project is its being very personal. I decided to analyze my personal data in August, 2022. Therefore, I granted many new permissions to Google along with many previously granted permissions. They keep them in various formats including .csv, .json, .mbox etc. When you query for your personal data, they provide it within a couple of days depending on the size of the data you queried.

Usually, I provide the readers with the data in my posts. However, in this series, the data are very personal and so I will not.

Introduction: “My Locations”

In this part of the series, we will investigate my personal location data. We will visualize the spots I visited within a period of time. This way, I personally will gain insights about how boring my days are :)

The R packages that we use in this post are as follows: rjson, tidyr, dplyr, purrr, lubridate, sp and leaflet.

##packs for data processing
library(rjson)      # to read .JSON files.
library(tidyr)      # to process data
library(dplyr)      # to process data
library(purrr)      # to process data
library(lubridate)  # to deal with date variables
#packs for data viz
library(sp)         # a pack for spatial objects
library(leaflet)    # map and its functions

Understand the Data

Inside the takeout folder that I received from Google, there is a folder named “Location History”. Inside it, “Semantic Location History” contains the location data based on the months and years. From that folder, I have called the locations I visited in November. Thus, we will use 2022_NOVEMBER.json file. Let’s investigate the data. Start with reading the file into R environment.

my_locations <- fromJSON(file = "2022_NOVEMBER.json")

Then, let’s try to understand the structure of the data, how and what kind of information is stored into its cells. The list object my_locations contains many lists inside it. Let’s try to understand each one of them one by one:

summary(my_locations[[1]])

      Length Class  Mode
 [1,] 1      -none- list
 [2,] 1      -none- list
 [3,] 1      -none- list
 [4,] 1      -none- list
 [5,] 1      -none- list
 [6,] 1      -none- list
 [7,] 1      -none- list
 [8,] 1      -none- list
 [9,] 1      -none- list
[10,] 1      -none- list
[11,] 1      -none- list
[12,] 1      -none- list
[13,] 1      -none- list
[14,] 1      -none- list
[15,] 1      -none- list
[16,] 1      -none- list
[17,] 1      -none- list

There are many smaller lists in the first indexed list. Let’s try the first one and see what’s inside:

summary(my_locations[[1]][[1]])

           Length Class  Mode
placeVisit 11     -none- list

There is a single list inside. Sad :( Let’s dive one more step:

summary(my_locations[[1]][[1]][[1]])

                        Length Class  Mode     
location                8      -none- list     
duration                2      -none- list     
placeConfidence         1      -none- character
centerLatE7             1      -none- numeric  
centerLngE7             1      -none- numeric  
visitConfidence         1      -none- numeric  
otherCandidateLocations 4      -none- list     
editConfirmationStatus  1      -none- character
locationConfidence      1      -none- numeric  
placeVisitType          1      -none- character
placeVisitImportance    1      -none- character

Finally, here we have several items. There is a list called location containing 8 items inside. There is duration with 2 items and otherCandidateLocations with 4 items. Other lists contain only one item each. Let’s check these one by one:

summary(my_locations[[1]][[1]][[1]]$location)

                      Length Class  Mode     
latitudeE7            1      -none- numeric  
longitudeE7           1      -none- numeric  
placeId               1      -none- character
address               1      -none- character
semanticType          1      -none- character
sourceInfo            1      -none- list     
locationConfidence    1      -none- numeric  
calibratedProbability 1      -none- numeric

summary(my_locations[[1]][[1]][[1]]$duration)

               Length Class  Mode     
startTimestamp 1      -none- character
endTimestamp   1      -none- character

summary(my_locations[[1]][[1]][[1]]$otherCandidateLocations)

     Length Class  Mode
[1,] 7      -none- list
[2,] 7      -none- list
[3,] 7      -none- list
[4,] 7      -none- list

We can obtain much information through this investigation process. For instance, inside the location I can see information about the latitude, longitude, address, the confidence that I have to this place, and some other. Here, if you are following along with me, please spare some time to understand your data. Delve into them and digest as much information as you can. I will see you in the next section: data processing.

Pre-processing

You can use as many items as you want in your work. You should decide the meaningful information while understanding your data. Now let’s re-define our lists as a dataframe.

df <- map_dfr(my_locations[["timelineObjects"]], as.data.frame)
View(df)
# there is one empty row after each entry. Let's drop them through one of the complete columns:
df <- drop_na(df, placeVisit.location.latitudeE7)

There are many columns, some of which I won’t need. Especially, I am not interested in the locations defined as “candidate”. I will exclude them from my study. They are probably the locations that might be the place that I visited ordered by possibility. I just need the one with the highest possibility, which is tagged with placeVisit.location. . These locations are also defined as “HIGH CONFIDENCE”. Let’s continue the analysis with these locations, only.

Also, there are some columns with no entry. Let me exclude them with a function. Let the function be called not_all_na. This is a function that drops all the columns which are completely empty:

not_all_na <- function(x)
  any(!is.na(x))
#use the function on the dataframe:
df <- df %>% select(where(not_all_na))

Now, I have a dataframe with 150+ columns. However, I just need the information about latitude, altitude, date and address of the locations that I visited. Let’s write a query to get this data into a new dataframe:

lat <- select(df, contains("placeVisit.location.latitudeE7"))
lon <- select(df, contains("placeVisit.location.longitudeE7"))
address <- select(df, contains("placeVisit.location.address"))
date <- select(df, contains("placeVisit.duration.startTimestamp"))

The chunks of code above ask for columns whose names contain the extensions written in quotation marks in them. Still, this raw information isn’t enough for several reasons. Firstly, lat and lot are coordinates in E7 format. With a quick research on the internet, I learned that they simply need to be divided by 10000000. Also, date contains day, month, year, hour, minute, second and time zone (which is in GMT+0 format) information all in the same column. They need to be handled. Let’s start with the second issue (the one about date):

#re-name the only column:
names(date) <- "Date"
head(date)

                      Date
1 2022-11-06T13:12:38.091Z
2 2022-11-06T13:24:32.338Z
3 2022-11-06T13:59:00.539Z
4 2022-11-06T14:17:15.462Z
5 2022-11-06T14:39:29.138Z
6 2022-11-07T05:32:14.277Z

As can be seen above, there are two separators: One is “T” separating day and time info. The other is “.” separating time and time zone info. Follow the notes in the code to grasp the process:

#divide the day and hour info from the time zone info, then drop the time zone:
date <-
  separate(
    data = date,
    col = Date,
    into = c("Date", "zone"),
    sep = "\\."
  )
date <- date[-c(2)]

#Now, transform the time in local time zone which is GMT+3:
date$Date<-as.POSIXct(date$Date, format="%Y-%m-%dT%H:%M:%S", tz=Sys.timezone())+ hours(3)

#divide the day and hour info:
date <-
  separate(
    data = date,
    col = Date,
    into = c("Day", "Hour"),
    sep = " "
  )
#see the new format:
head(date)

         Day     Hour
1 2022-11-06 16:12:38
2 2022-11-06 16:24:32
3 2022-11-06 16:59:00
4 2022-11-06 17:17:15
5 2022-11-06 17:39:29
6 2022-11-07 08:32:14

Nicely done! Now gather all the information that we need into a dataframe. Again follow along the notes in the code:

coords <-
  drop_na(data.frame(
    lat = unlist(lat, use.names = FALSE) / 10000000, #divide lat and lon by 10000000 to get rid of the E7 format
    lon = unlist(lon, use.names = FALSE) / 10000000, 
    address = unlist(address, use.names = FALSE),
    date # we processed this before
  ))

So far, we have worked to prepare for the data visualization process. Our data is ready with the name coords. Let’s continue with the visualization.

Data Visualization

At this point, we will visualize the locations I visited in November of 2022 on a world map. You can’t be as disappointed as me when you see that I live a life between home and work. Yet, the point here is to see the process of visualization. We owe this beautiful project to the R package leaflet. It is actually a javascript library, all its arguments are deployed into R environment too. Therefore, we can work with it. If you are still with me, I mhighly recommend you to read the documentation of the package leaflet. Then, follow along the notes in the code and try to understand it if you are not familiar with it.

coordinates(coords) <- ~ lon + lat
leaflet(coords,

# formating the outer of the map:
        width = "800px",
        height = "400px", 
        padding = 10) %>% 
  addTiles() %>%

#formating the markers on the map:
  addCircleMarkers(
    color = "tomato", #my favorite colour
    fillOpacity = 1,
    radius = 7,
    stroke = FALSE,
    
#address pops up when you click on a marker:
    popup = coords$address,

#the date and hour shows up with a fancy personal note when you hover on a marker:
    label =  paste0("I have been around here on ", coords$Day, " at around ", coords$Hour),

#formating the label that shows up when you hover:
    labelOptions = labelOptions(
      noHide = F,
      direction = "top",
      style = list(
        "color" = "black",
        "font-family" = "calibri", #I love calibri
        "box-shadow" = "3px 3px rgba(0,0,0,0.25)",
        "font-size" = "12px",
        "border-color" = "rgba(0,0,0,0.5)"
      )
    )
  )

Conclusion

Working with your personal data gives you the opportunity to understand your own habits, likes, dislikes, and maybe future expectations. Here, you can only see my locations in November. When I worked on longer periods, I realized that I need to travel and see new places more often. Even if they are in my own city, a new place is a new vision of life.

Visualizing data on spatial environments is a new challenge for me. Rather than graphs and charts, working with maps are more attractive obviously. While visualizing location data on maps, leaflet is an amazing, open source library. There are other options. One needs a mention here: ggmap. Yet, to use this package you need an API key obtained from Google. For more information about API keys, visit here. As of the package, you can visit the CRAN page of ggmap. Under the title “Google Maps API key”, you will see the procedure to buy a personal API key. It reads as follows:

GOOGLE MAPS API KEY [@ggmap]

A few years ago Google has changed its API requirements, and ggmap users are now required to register with Google. From a user’s perspective, there are essentially three ramifications of this:

Users must register with Google. You can do this at https://mapsplatform.google.com. While it will require a valid credit card (sorry!), there seems to be a fair bit of free use before you incur charges, and even then the charges are modest for light use.
Users must enable the APIs they intend to use. What may appear to ggmap users as one overarching “Google Maps” product, Google in fact has several services that it provides as geo-related solutions. For example, the Maps Static API provides map images, while the Geocoding API provides geocoding and reverse geocoding services. Apart from the relevant Terms of Service, generally ggmap users don’t need to think about the different services. For example, you just need to remember that get_googlemap() gets maps, geocode() geocodes (with Google, DSK is done), etc., and ggmap handles the queries for you. However, you do need to enable the APIs before you use them. You’ll only need to do that once, and then they’ll be ready for you to use. Enabling the APIs just means clicking a few radio buttons on the Google Maps Platform web interface listed above, so it’s easy.
Inside R, after loading the new version of ggmap, you’ll need provide ggmap with your API key, a hash value (think string of jibberish) that authenticates you to Google’s servers. This can be done on a temporary basis with register_google(key = "[your key]") or permanently using register_google(key = "[your key]", write = TRUE) (note: this will overwrite your ~/.Renviron file by replacing/adding the relevant line). If you use the former, know that you’ll need to re-do it every time you reset R.

Your API key is private and unique to you, so be careful not to share it online, for example in a GitHub issue or saving it in a shared R script file. If you share it inadvertantly, just get on Google’s website and regenerate your key - this will retire the old one. Keeping your key private is made a bit easier by ggmap scrubbing the key out of queries by default, so when URLs are shown in your console, they’ll look something like key=xxx. (Read the details section of the register_google() documentation for a bit more info on this point.)

Stay tuned!

This series continues with the visualization of my Google Fit data. We will delve into my exercise habbits.

Submit your e-mail to subscribe!